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ABSTRACT 

Weighted  average  derivatives  are  useful  parameters  for  semiparametric  index  models, 
including  limited  dependent  variables,   and  in  nonparametric  demand  estimation. 
Efficiency  of  weighted  average  derivative  estimators  is  a  concern,  because  the  weight  may 
affect  efficiency  and  the  presence  of  a  nonparametric  function  estimator  might  lead  to 
low  efficiency.     This  purpose  of  this  paper  is  to  give  efficiency  results  for  -average 
derivative  estimators,   including  formulating  estimators  that  have  high  efficiency. 

Our  efficiency  analysis  proceeds  by  deriving  the  semiparametric  efficiency  bounds 
for  average  derivative  and  index  models,  and  then  formulating  average  derivative 
estimators  with  high  efficiency.     We  first  derive  the  bound  for  weighted  average 
derivatives  of  conditional  location  functionals,  such  as  the  conditional  mean  and  median. 
This  bound  gives  the  asymptotic  distribution  of  any  sufficient  well-behaved  weighted 
average  derivative  estimator.     We  compare  efficiency  for  different  location  measures, 
e.g.  mean  versus  median,  finding  that  the  comparison  is  similar  to  the  location  model. 

Many  semiparametric  limited  dependent  variable  and  regression  models  take  the  form 
of  index  models,  where  the  location  measure  (e.g.   conditional  mean)  depends  only  on 
linear  combinations  of  the  regressors,  i.e.  on  "indices."     We  derive  the  semiparametric 
efficiency  bound  for  these  index  models.     We  then  compare  the  bound  with  the  asymptotic 
variance  of  weighted  average  derivative  estimators  of  the  index  coefficients.     We  derive 
the  form  of  an  efficient  weight  function  when  the  distribution  of  the  regressors  is 
elliptically  symmetric  and  discuss  existence  of  optimal  weights.     We  discuss  combining 
different  weights  to  achieve  efficiency,   in  the  process  deriving  a  general  condition  for 
approximate  efficiency  of  a  pooled  minimum  chi-square  estimator  in  a  semiparametric 
model.     Also,  we  discuss  ways  the  type  and  number  of  weights  could  be  selected  to  achieve 
high  efficiency  in  practice. 

Keywords:     average  derivative,   index  model,  efficiency  bound,   optimal  weights,  minimum 
chi-square,  spanning  condition. 


1.        Introduction 

Average  derivatives  are  useful  parameters  in  a  number  of  semiparametric  models.     As 
discussed  in  Stoker  (1986),   they  can  be  used  in  estimation  of  index  models,   including 
limited  dependent  variable  models  and  partially  linear  regression  models.     Also,  they  are 
used  in  nonparametric  demand  estimation,  as  in  Hardle,  Hildenbrand,   and  Jerison  (1991). 
Efficiency  of  average  derivative  estimators  is  a  concern,  because  there  are  several  types 
that  have  been  proposed.     Also,  the  presence  of  a  nonparametric  function  estimator  might 
lead  to  low  efficiency.     This  purpose  of  this  paper  is  to  give  efficiency  results  for 
average  derivative  estimators,   including  formulating  estimators  that  have  high 
efficiency. 

Our  efficiency  analysis  proceeds  by  deriving  the  semiparametric  efficiency  bounds 
for  average  derivative  and  index  models,  and  then  formulating  average  derivative 
estimators  with  high  efficiency.     We  first  derive  the  bound  for  weighted  average 
derivatives  of  conditional  location  functionals,  such  as  the  conditional  mean  and  median. 
For  the  conditional  mean  the  previously  suggested  estimators  of  Hardle  and  Stoker  (1989) 
and  Stoker  (1991a)  attain  this  bound.     This  result  is  what  one  would  anticipate,  because 
this  bound  places  no  restrictions  on  the  data  distribution,  and  in  such  cases  any 
estimator  that  is  asymptotically  equivalent  to  a  sample  average  and  sufficiently  regular 
will  be  efficient  (e.g.   see  Newey,  1990a).     We  also  give  the  bound  for  other  location 
functionals  such  as  the  median.     We  find  that  when  the  efficiency  of  average  derivative 
estimators  for  different  location  functionals  can  be  compared,  the  comparison  is  similar 
to  that  for  location  models,  e.g.   with  the  average  derivative  conditional  median  being 
more  efficient  that  the  conditional  mean  for  "fat-tailed"  distributions. 

Many  semiparametric  limited  dependent  variable  and  regression  models  take  the  form 
of  index  models,   where  the  location  measure  (e.g.   conditional  mean)  depends  only  on 
linear  combinations  of  the  regressors,   i.e.   on  "indices."     We  derive  the  semiparametric 
efficiency  bound  for  these  index  models.     We  then  compare  the  bound  with  the  asymptotic 
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variance  of  weighted  average  derivative  estimators  of  the  index  coefficients. 

In  an  index  model,  consistent  estimators  arise  from  the  use  of  different  weighting 
functions,  so  an  important  efficiency  question  is  the  choice  of  weights.     We  derive  the 
form  of  an  efficient  weight  function  when  the  distribution  of  the  regressors  is 
elliptically  symmetric.     It  is  also  shown  that  linearity  of  certain  conditional 
expectations  is  necessary  for  existence  of  an  efficient  weight.     We  discuss  the 
possibility  of  combining  different  weights  to  achieve  efficiency,  showing  that  it  is 
possible  to  obtain  an  approximately  efficient  estimator  by  pooling  using  minimum 
chi-square.     We  give  general  results  on  when  such  pooling  will  lead  to  efficiency  in 
semiparametric  models  and  on  how  the  number  of  estimators  to  combine  can  be  chosen  from 
the  data.       These  results  are  specialized  to  derive  conditions  for  achieving  approximate 
efficiency  from  combining  many  weighted  average  derivative  estimators.     Also,  we  suggest 
ways  of  combining  a  few  weighted  average  derivative  estimators  so  as  to  achieve  high 
efficiency. 

Other  papers  give  some  efficiency  results  on  average  derivatives  or  index  models. 
Chamberlain  (1987)  previously  derived  the  semiparametric  efficiency  bound  for  conditional 
mean,  single  index  models.       Following  our  initial  work  Samarov  (1990)  gave  the 
efficiency  bound  for  the  (unweighted)  average  derivative  of  the  conditional  means.     Newey 
(1991a)  gives  efficiency  bounds  for  linear  functionals  of  mean-square  projections  (that 

includes  average  derivatives  of  conditional  means  as  a  special  case).     None  of  these 

2 

papers  gives  regularity  conditions  for  the  bounds.       Recently  Hall  and  Ichimura  (1991) 

derived  some  efficiency  results  for  index  estimators  when  there  is  a  residual  that  is 
independent  of  the  regressors. 


We  cite  the  working  paper  Chamberlain  (1987)  because  the  index  model  bound  does  not 
appear  in  the  published  version. 


To  be  precise,  they  do  not  exhibit  a  sequence  of  regular  parametric  submodels  for  which 
the  Cramer-Rao  for  the  submodel  approximates  the  candidates  for  the  bound  they  suggest. 
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2.        Weighted  Average  Derivatives  and  Partial  Index  Models 

Let     y     denote  a  dependent  variable,     x     a     k  x  1     vector  of  regressors,     pic)     a 
loss  function  of  a  real-valued  variable,  and 


(2.1)  g(x)  =  argmin  E[p(y  -  g)|x]. 

Here     g(x)     is  a  conditional  location  function.     Examples  include  the  conditional 

2 
mean  for     pic)  =  c  ,     the  conditional  median  for     pie)  =   |e|,     as  well  as  other  more 

exotic  location  functionals  such  as  quantiles  or  expectiles.     Our  interest  is  in 

properties  of  estimators  of  weighted  average  derivatives  of     g(x).     Partition     x     as     x  = 

T     T  T 
(x   ,x„)    ,     suppose     x      is  continuously  distributed,  and  for  a  function     a(x)     let     a' (x) 

=  5a(x)/5x  .     A  weighted  average  derivative  of     g(x)     is 

(2.2)  5  =  E[w(x)g'(x)], 


where     w(x)     is  a  scalar  function.     For  the  average  derivative  to  be  well  defined,     x 
must  be  continuously  distributed,   but     x„     can  be  discrete. 

A  primary  motivation  for  weighted  average  derivatives  is  a  partial  index  model,  with 

(2.3)  g(x)  =  G(x]0,x  ). 

T  T 

Under  this  model,     5  =  E[w(x)3G(x  (3,x „)/3(x ./3)]*/3,     so  that  the  weighted  average 

derivative  is  proportional  to     j3,     and  hence  can  be  used  to  estimate     |3     up  to  scale. 
Rodriguez  and  Stoker  (1992)  have  recently  used  this  model  in  specification  analysis  for 
estimation  of  conditional  means. 

The  motivation  for  the  index  model  where     g(x)     is  the  conditional  mean  and     x       is 
not  present  is  well  known  (e.g.   see  Stoker,   1986).     Other  cases  can  be  motivated  by  a 
variety  of  semiparametric  regression  models.     In  particular,  suppose  that 
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(2.4)  y|x  =  y|(x^,x2), 


a  conditional  distribution  index  model  that  is  analyzed  in  Newey  (1990b).     This  model 
allows  for     y     to  have  a  conditional  mean  and  variance  (and  other  moments)  that  depend  on 

x       in  an  arbitrary  way.     It  is  implied  by  many  interesting  more  restrictive  models.     For 

T 

instance,   it  is  implied  by     y  =  t(x  8  +  ^ix  )  +  cr(x  )v),  where     v     is  independent  of     x 

and     x     is  a  transformation  that  can  be  either  known  or  unknown.     The  transformation 
x(r)     could  be     x(r)  =  l(r  >  0),     corresponding  to  a  binary  choice  model  that  allows  for 

heteroskedasticity  to  depend  on     x       in  an  arbitrary  way,  or  it  could  be     x(r)  =  £     (r), 

T 

corresponding  to  a  model     £(y)  =  x  6  +  7)     that  is  important  for  duration  data.     It  then 

implies  that  equation  (2.3)  is  satisfied,  so  that  partial  index  models  are  implied  by 
transformed,  semiparametric  regression  models  that  allow  heteroskedasticity  to  depend 
nonparametrically  on  some  regressors. 

For  the  semiparametric  model  in  equation  (2.4),  the  choice  of  loss  function     pic) 
can  be  motivated  by  efficiency  considerations  similar  to  those  for  the  linear  model, 
namely  that  if  the  distribution  of     y     is  fat-tailed,  then  a  more  efficient  estimator 
might  be  obtained  by  working  with     pic)     that  gives  less  weight  to  large  values  of     pic). 
This  feature  will  become  apparent  from  the  semiparametric  efficiency  bounds  derived 
below.     Also,   comparison  of  parameters  from  different  choices  of     pic)     may  allow  one  to 
test  restrictions  on  the  conditional  distribution  of     y     given     x,  similarly  to  Koenker 
and  Bassett  (1982). 

In  some  cases  it  is  possible  to  weaken  equation  (2.4)  so  that  essentially  only  one 
choice  of     pic)     will  produce  a  partial  index  model.     An  important  such  case  is  where 

(2.5)  y  =  x(x^/3  +  u(x  )  +  v,  x^,     x(r,  x^     is  monotonic  in     r,     median(v|x)  =  0. 

Because  the  median  of  a  monotonic  transformation  is  the  transformation  of  the  median,  this 
will  be  a  partial  index  model  for  the  conditional  median,  where     pic)  =    |e|,     but  not  for 
other  conditional  location  measures.     This  model  is  a  generalization  of  one  considered  by 
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Powell  (1991).     For  the  case  where     x       is  not  present,  Doksum  and  Samarov  (1992)  have 
suggested  using  average  derivative  estimators  to  estimate     t(v)     and  its  inverse  when 
t(v)     is  monotonic. 

The  primary  purpose  of  this  paper  is  to  develop  the  efficiency  properties  of  weighted 
average  derivative  and  partial  index  estimators,  and  discuss  how  and  when  different 
weighted  average  derivative  estimators  can  be  combined  into  an  approximately  efficient 
estimator  of  a  partial  index  model.     It  is  beyond  the  scope  of  this  paper  to  discuss  the 
properties  of  particular  estimators,   although  the  results  here  have  implications  for  the 
asymptotic  properties  of  average  derivative  estimators.     As  discussed  in  Newey  (1991a), 
the  semiparametric  efficiency  bound  for  an  unrestricted  functional,   such  as  an  average 
derivative,   is  the  asymptotic  variance  of  any  sufficiently  regular  estimator.     Thus,  one 
would  expect  that  asymptotic  variance  of  average  derivative  estimators  to  have  the  form 
given  below. 

For  instance,   consider  the  following  kernel  estimator  that  is  well-defined  even  when 
p(e)     is  not  smooth  (e.g.  for     p(e)  =    |e|).     Let     f(x)     be  the  density  of     x.     By 
integration  by  parts 

(2.6)  a  =  E[£(x)g(x)],     £(x)  =  -w'(x)  -  w(x)f '  (x)/f(x) 

For  a  kernel     K(v)     let     K,  (x)  =  h"kX(x/h),     let     f(x)  =  T^K.  (x-x.)/n     be  a  kernel 

h  ^i=l   h         i 

density  estimator  and  let     g(x)  =  argminT.=1K,(x-x.)p(y.-g)/n     be  the  kernel  estimator 
of  Tsybakov  (1982).     Then  an  estimator  of     5     corresponding  to  equation  (2.6)  is 

(2-7)  S  =  (E."1w(x.)/n)"1[5:iy(xi)xi/n]"1j:iy(x.)i(x.)/n, 

2(x.)  =  -w'(x.)  -  w(x.)f'(x.)/f(x.). 
i  i  ill 

2 
For     w(x)  =  1     and     pic)  =  e  ,     this  estimator  similar  to  those  analyzed  in  Stoker 

(1991a).3 

3 

The  two  leading  terms  form  a  nonparametric  estimator  of  the  identity  matrix,  and  so  do 


An  alternative  approach  is  to  use  a  series  estimator  with  smooth  approximating 
functions.     Let     P    (x)     be  a     K  x  1     vector  of  differentiate  functions.     Suppose  that 
linear  combinations  of  this  vector  can  approximate  functions  and  their  derivatives. 
Consider  the  estimator 

(2.8)  d  =  E.J^wtx.Jg'  (x.)/n,     g(x)  =  nTPK(x),     n  =  argmin£."  p(y.  -  ttTPK(x.)). 


This  estimator  is  based  on  differentiating  the  series  estimator     g(x)     of     g(x).     For  the 
conditional  mean  case  this  estimator  is  analyzed  in  Newey  (1991a). 

Regularity  conditions  for  asymptotic  normality  of  these  estimators  are  beyond  the 
scope  of  this  paper.     Nevertheless,  the  results  of  this  paper  do  provide  a  formula  for 
the  asymptotic  variance  of  these  estimators.     As  discussed  in  Section  3,  the 
semiparametric  bound  we  derive  will  be  the  asymptotic  variance  of  any  estimator  that  is 
asymptotically  equivalent  to  a  sample  average  and  sufficiently  regular. 


3.        Efficiency  for  Weighted  Average  Derivative  Estimation 

In  this  Section  we  derive  the  semiparametric  efficiency  bound  for  weighted  average 
derivative  estimators.     The  average  derivative  is  an  unrestricted  parameter,  in  that  its 
definition  places  no  substantive  restrictions  on  the  data  distribution.     The  efficiency 
bound  for  estimators  of  such  unrestricted  parameters  can  be  calculated  as  the  variance  of 
the  pathwise  derivative  of  the  parameter  with  respect  to  the  distribution  of  the  data,   as 
shown  in  Pfanzagl  and  Wefelmeyer  (1982). 

To  discuss  the  pathwise  derivative  it  is  useful  to  introduce  more  terminology.     Let 

z  =  (y,xT)     denote  a  single  observation  and     f(z)     the  true  density  of     z     (with  respect 

to  some  dominating  measure).     A  "parametric  submodel"  or  "path"  is  a  parametric  family  of 

not  contribute  to  the  asymptotic  variance,  but  they  lead  to  reduction  in  a  severe  finite 
sample  bias  in  the  estimator,  as  discussed  in  Stoker  (1991b). 
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densities     f(z|0)     (with  respect  to  some  dominating  measure)  that  pass  through  the  truth, 
i.e.   such  that     f(z|6   )  =  f(z)     for  some  parameter  value     9   .     Let     5(9)     denote  the 
value  of  the  average  derivative  in  equation  (2.2)  when     z     is  distributed  as     f(z|9), 
and  let     S   (z)     denote  the  score  of     f(z|9)     at     9  =  9A,     where  typically     S   (z)  = 
31nf(z|9)/d9l      Q       but     SQ(z)     is  precisely  defined,  e.g.,   in  Newey  (1990a).     The 

0  —  0  0 

0 

T 
pathwise  derivative  is  a  function     0(z)     with  finite  mean-square  (i.e.   E[i//(z)   i//(z)]  <  oo) 

such  that     E[ip(z)]  =  0     and  for  any  sufficiently  regular  submodel, 

(3.1)  as(e)/ae\a  n    =  e[^(z)sq(z)t]. 

o 

When  the  distribution  of     z     is  unrestricted,  as  it  is  here,  it  is  typically  possible  to 

show  that  a  parametric  submodel  can  be  chosen  so  that  the  score     S   (z)     approximates  any 

0 

function  of     z     with  mean  zero  and  finite  mean-square.     In  that  case,   a  lower  bound  on  the 
asymptotic  variance  of  semiparametric  estimators  of     8     is 

(3.2)  V  =  E[0(z)0(z)T]. 


Briefly,  the  idea  behind  this  formula  is  that  by  the  reasoning  of  Stein  (1956),     V 

should  be  the  supremum  of  Cramer-Rao  bounds  over  all  parametric  submodels.     The 

Cramer-Rao  bound  for  a  submodel  is     Ed/»(z)SQ(z)T](E[SQ(z)SQ(z)T])_1E[SQ(z)i//(z)T]     by 

G  0         9  9 

equation  (3.1)  and  the  "delta-method."     This  matrix  is  bounded  above  by     V     and  it  is 

approximately  equal  to     V     when     S  (z)     is  approximately  equal  to     i//(z). 

0 

Another  interpretation  of     ^(z)     is  as  the  influence  function  of  an  efficient 
estimator     8,  satisfying 


(3.3)  VK(8  -  8)  =  y.n,iMz.Wn  +  o  (1). 

^i=r     i  p 


The  "influence  function"  terminology,  which  originated  in  the  robust  estimation 
literature,   is  motivated  by  the  fact  that  in  large  samples     i/»(z)     approximately  gives  the 
effect  of  a  single  observation  on     8.     It  is  a  convenient  way  to  think  about  the 
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asymptotic  properties  of  Vn-consistent  estimators,  because  most  will  satisfy  equation 
(3.3)  for  some  influence  function  (e.g.  mean  average  derivative  estimators,   as  shown  in 

Stoker,   1991a).     By  the  central  limit  theorem  the  asymptotic  variance  of     S     will  by 

T 
E[0(z)i/»(z)    ],     equal  to  the  bound  when  the  influence  function  equals  the  pathwise 

derivative  in  equation  (3.2). 

In  general,  any  influence  function  will  be  a  pathwise  derivative,  i.e.  if  an 

estimator  satisfies  equation  (3.3)  for  some     0(z),     and  certain  regularity  conditions 

hold,  then  its  influence  function  satisfies  equation  (3.1)  (e.g.   see  Newey,   1990a).     When 

the  model  imposes  no  substantive  restrictions  on  the  data  distribution,  the  set  of  scores 

is  unrestricted  (except  for  having  mean  zero),  as  it  is  in  this  Section.     Therefore, 

4 
there  can  be  at  most  one  pathwise  derivative,  and  hence  at  most  one  influence  function. 

As  discussed  in  Newey  (1991a),  this  fact  can  be  used  to  find  the  influence  function  of 

any  estimator,  by  calculating  the  pathwise  derivative  of  the  parameter  that  is  estimated 

under  unrestricted  distributions.     In  particular,  the  influence  function  of  an  average 

derivative  estimator  will  be  equal  to  the  pathwise  derivative  derived  below,  because  we 

impose  no  substantive  restrictions  on  the  data  distribution.     Thus,   we  are  justified  in 

interpreting     V     not  only  as  the  variance  bound  for  average  derivative  estimators  but 

also  as  the  asymptotic  variance  of  any  (regular)  average  derivative  estimator  satisfying 

equation  (3.3).     For  instance,  the  bound  derived  below  will  be  the  asymptotic  variance  of 

the  kernel  and  series  estimators  suggested  at  the  end  of  Section  2,  under  the  plausible 

assumptions  that  they  satisfy  equation  (3.3)  and  are  regular. 

Because  the  form  of  the  pathwise  derivative  is  the  main  result  of  this  Section,  we 

will  first  give  this  form  and  discuss  it,  postponing  the  regularity  conditions  until  the 

end  of  the  Section.     Let     e  =  y  -  g(x)     and 

For  two  different  influence  functions     0(z)     and     $(z),     choosing     SQ(z) 
(approximately)  equal  to     0(z)-i/»(z)     and  differencing  equation  (3.1)  implies     0  = 
E[<$(z)-i//(z)>TSJz)]  =  E[$(z)-ipU)>T<4i(z)-tl>(z)}]. 

0 
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(3.4)  u  =  -v(x)    m(c),     m(e)  =  dp(e)/de,     v(x)  =  dE[m(y-g)  |x]/dg|        (x)- 

Theorem  3.1  below  shows  that  the  pathwise  derivative  is 

(3.5)  i//(z)  =  w(x)g'(x)-5  +  «x)u  =  w(x)g'(x)-5  -  £(x)p(x)_1m(y-g(x)), 

where     £(x)     is  given  in  equation  (2.6).     The  semiparametric  variance  bound  is  then 

E[.//(zWz)T]. 

Two  interesting  special  cases  are  the  conditional  mean  and  median.     For  the  mean, 
u  =  e,     so  that 

(3.6)  i//(z)  =  w(x)g'(x)  -  5  +  £(x)[y  -  g(x)]. 

For  the  median,     u  =  [2f(0|x)]    sgn(e),     where     f(0|x)     is  the  conditional  density  of     e 
given     x     at     e  =  0     and     sgn(e)  =  Kg  >  0)  -  l(e  <  0),     so  that 

(3.7)  </»(z)  =  w(x)g'(x)  -  5  +  £(x)[2f(0|x)]_1sgn(y  -  g(x)). 

In  general,  the  variance  bound  can  be  decomposed  into  two  terms.       By     E[u|x]  =  0, 

(3.8)  EMzWz)T]  =  Var(w(x)g'(x))  +  E[u2£(xWx)T]. 

The  first  term  is  the  asymptotic  variance  bound  when     g(x)     is  known,  being  the 
asymptotic  variance  of     £.     w(x.)g'(x.)/n.     The  second  piece  is  the  bound  when     f(x)     is 
known,  by  the  following  argument:     i/»(z)     still  satisfies  equation  (3.1)  for  more 
restrictive  parametric  submodels  where  the  distribution  of     x     is  known,   scores  satisfy 
E[S  (z)|x]  =  0     because     8     parameterizes  the  conditional  distribution  of     y     given     x, 

0 

T  T 

so  that     E[i/»(z)S  (z)    ]  =  E[£(x)uS  (z)    ].     Also,     SAz)     can  approximate  any  conditional 

0  0  0 

mean  zero  function,  and  hence  can  approximate     £(x)u.     Thus,  the  argument  for  equation 
We  thank  Gary  Chamberlain  for  suggesting  the  following  interpretation. 


9  - 


(3.2)  also  applies  here. 

2  T  _     2  T 

The  magnitude  of  second  piece     E[u  Ux)Ux)    ]  =  E[E[u    \x]l{x)l{x)    ]     corresponding  to 

unknown     g,     depends  on     p(e)     similarly  to  the  way  the  asymptotic  variance  of  location 

estimators  depends  on  the  loss  function.     In  particular,     Ux)     does  not  depend  on  the 

2  -2  2 

form  of     pic),     while     E[u    |x]  =  v(x)     E[m(e)    |x]     is  equal  to  the  bound  for  estimation  of 

the  location  parameter     argmin  E[p(e-fi)]     if     e.     were  i.i.d.  with  density     f(e.|x), 

where     m(e)     and     v(x)     are  given  in  equation  (3.4).     Thus,  when  averages  derivatives  for 

different     p(e)     functions  are  equal,  so  that  the  corresponding  bounds  can  be  compared, 

the  comparisons  can  be  carried  out  in  a  way  similar  to  that  for  estimation  of  location 

parameters.     For  example,   when  the  conditional  density  of     e     given     x     has  "thick  tails" 

for  most  values  of     x,     the  second  piece  of  the  bound  will  tend  to  be  smaller  for  the 

conditional  median  than  for  the  conditional  mean. 

Turning  now  to  the  statement  of  regularity  conditions,  we  first  give  an  assumption 

that  is  essential  to  the  result. 

Assumption  3.1:     w(x)f(x)     is  zero  on  the  boundary  of  the  support  of     x. 

This  condition  allows  us  to  ignore  boundary  terms  in  the  derivation  of  the  bounds. 
Without  this  assumption,  E[w(x)g'(x)]     may  include  boundary  terms  that  depend  on     g(x) 
evaluated  at  particular  points.     For  continuously  distributed  regressors,   the  value  of  a 
conditional  expectation  at  a  point  has  an  infinite  variance  bound,   so  that  average 
derivative  will  not  be  Vn-consistently  estimable.     For  a  simple  example,   suppose  that 
x     is  a  scalar  that  is  uniformly  distributed  on     [0,1],     g(x)     is  the  conditional  mean, 
and     w(x)  =  1.     Then 

(3.9)  5  =  E[g'(x)]  =   [  g'(x)dx  =  g(l)  -  g(0). 

It  is  easy  to  show  that  the  semiparametric  variance  bound  is  infinite  in  this  case,   which 
is  consistent  with  the  well  known  fact  that  value  of  a  conditional  expectation  at  a 
particular  point  is  not  v/n-consistently  estimable. 
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One  way  to  guarantee  that  this  assumption  holds  in  index  models,  is  to  choose     w(x) 
to  be  zero  outside  the  interior  of  the  support  of     x.     Of  course,   such  a  choice  might  be 
in  conflict  with  the  efficient  choice  of  weights  discussed  in  Section  5. 

Additional  regularity  conditions  are  useful  for  deriving  the  result. 

Assumption  3.2:     The  support     X     of     x     is  convex  and  compact,     there  is  compact     ?:R 
containing  the  closure  of     g(X)     in  its  interior  such  that     E[m(y-g)|x]  =  0     has  a  unique 

solution  at     g(x)     for     g  e  W,  and     Prob(i>(x)  *  0)  =  1.     Also,     E[llf '  (x)/f(x)ll2]  <  «, 

T 
E[i//(z)i//(z)    ]     exists  and  is  nonsingular,     w(x)     and     w'  (x)     are  bounded  on     X,     the 

conditional  distribution  of     y     given     x     has  conditional  density     f(y|x)     such  that 

1/2 
f(y|x)  is  mean-square  continuously  differentiate  in     x      on     X,     and  for  any     C(y>x' 

that  is  bounded  and  continuously  differentiable  in     x      with  bounded  derivatives, 

E[<(y,x)|x]     is  continuously  differentiable  in     x      on     X,     and     E[m(y-g)<(y,x)  |x]     is 

continuously  differentiable  in     x      and     g     on     Xx^. 


The  last  smoothness  condition  does  not  seem  very  primitive,  but  it  is  straightforward 

to  give  sufficient  conditions  for  particular     m(y-g).     For  example,  for     m(e)  =  c     it 

2 

will  follow  from  the  other  assumptions  and  continuity  of     Ely    |x],     and  for     m(e)  = 

1/2 
sgn(e)     it  will  be  implied  by     f(y|x)     being  absolutely  continuous  with     f(y+a|x) 

mean-square  continuously  differentiable  in  the  scalar     a     and     x  .       This  assumption  will 

not  be  satisfied  if     y     is  discrete  and     m(e)     is  not  continuous,  over  the  range  of 

y-g(x),     because  (for     C(y>x)  =  D     E[m(y-g)|x]     will  not  be  continuous  in     g.     In 

particular,   it  does  not  hold  if     y     is  binary  and     m(e)  =  sgn(c). 

As  is  well  known,  the  class  of  estimators  must  be  restricted  to  obtain  an  efficiency 

bound  result.     We  do  this  by  restricting  attention  to  estimators     5     that  are  regular, 

meaning  that  for  a  class  of  regular  parametric  submodels  the  limiting  distribution  of 

■/n(5-5(8   ))     does  not  depend  on     {9  >     when     Vn(d  -9„)     is  bounded  and  the  data  has 
n  n  n     0 

Assumption  3.2  follows  in  these  cases  by  Lemmas  C.2  and  C.3  of  Newey  (1991b)  and  by 
E[m(y-g)<;(y,x)  |x]  =  J"m(c)C(u+g,x)f(u+g|x)du     when     y     is  continuously  distributed. 
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distribution     f(z|9  )     for  each  sample  size     n. 
n 


T 
Theorem  3.1:     If  Assumptions  2.1  and  2.2  are  satisfied,  and     V  =  E[\p(z)\p(z)   ]     is 

nonsingular  for     ip(z)     in  equation  (3.5),  then     V  is  the  supremum  of  Cramer-Rao  bounds  of 

T 
all  regular  parametric  models  such  that     dd(Q)/de\  =  E[ip(z)SJz)   ]     and  any  estimator 

0—0  Q 

o 

8     of     8     that  is  regular  satisfies     Vn(8  -  8Q)  — >  Z     +  U     where     Z       is  distributed  as 
N(0,V)     and     U     is  independent  of     Z  .     Furthermore,  if     v7T(<5  -  8  )  =  ^"iKzJ/Vn  + 
o   (1),     where     E[ip(z.)]  =  0     and     E[ij)(z.)  [p(z.)]  <  oo,     and     8     is  regular  then     \p(z.)  = 

£J  tit  L 

ip(zj. 


In  the  environment  considered  here,  where  data  distribution  is  not  restricted,  the 
assumption  that  the  pathwise  derivative  formula  hold  for  a  parametric  submodel  is  just  a 
convenient  regularity  condition.     It  is  verified  in  the  proof  of  the  theorem  that  there 
exists  a  class  of  regular  parametric  models  where  the  pathwise  derivative  formula  is 
satisfied,  with  score  that  can  be  chosen  to  approximate  any  mean  square  random  vector. 

The  last  conclusion  of  this  theorem  shows  that  any  influence  function  must  equal 
that  given  in  equation  (3.5).     In  particular,  the  series  and  kernel  estimators  described 

in  equations  (2.7)  and  (2.8)  will  have     \jj(z)     as  their  influence  function,   and  hence 

T 

asymptotic  variance     E[<//(z)i//(z)    ],     as  long  as  they  satisfy  equation  (3.3)   (for  some 

\p(z))     and  are  sufficiently  regular. 


4.        Efficiency  Bounds  for  Multiple  Index  Models 

In  this  Section  we  derive  the  variance  bound  for  the  parameters  of  multiple  index 
models,  where  the  function     g(x)     described  earlier  is  restricted  to  depend  on  a  function 
of     x     and  parameters.     Let     v(x,0)     be  a  vector  of  functions  of     x     and  a     q  x  1 
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parameter  vector     0.     A  multiple  index  model  is  one  where  there  is  a  function     G(v)     such 
that 


(4.1)  g(x)  =  G(v(x,£0)). 


An  important  example  is  the  partial  index  model  discussed  in  Section  2,  where     v(x,0)  = 

T       T  T 

(x  0,x    )    .     The  pathwise  derivative  can  be  used  to  calculate  the  efficiency  bound  for 

estimators  of     0,     although  this  approach  must  be  modified  to  account  for  the 
restrictions  imposed  by  equation  (4.1).     We  now  carry  out  this  calculation,  using  tangent 
set  and  projection  methods. 

Because     0     is  now  an  implicit  parameter  rather  than  an  explicit  functional,  a  more 

specific  parameterization  of  the  problem  is  useful.     Consider  parameterizing  the 

T    T  T 
submodels  by  9  =  (0   ,T)   )    ,     where     tj     is  a  nuisance  parameter  vector  for  any  feature  of 

the  distribution  of     z     other  than     0.     Let     S„     and     S       denote  the  respective  scores, 

0  7) 

where  for  notational  convenience  we  have  suppressed  the     z     argument.     Then  equation 

(3.1)  for  a  pathwise  derivative  reduces  to 

(4.2)  E[0(z)sL  =  I,     E[0(z)ST]  =  0. 

By  the  same  reasoning  following  equation  (3.2),  the  efficiency  bound  will  be  the  variance 
of  i//(z)  such  that  this  equation  is  satisfied  and  t/»(z)  can  be  approximated  by  a  linear 
combination  of     S  .     This     \p{z)     can  be  found  by  a  projection  calculation.     Let     J     be 

the  mean-square  closure  of  the  union  of  all     q  x  1     linear  combinations  of  all  possible 

2 
nuisance  scores,  i.e.     J  =  U(z)  :   3  e  >  0,  constant  matrix     C,  S       with     E[llt-CS   II    ]  < 

c>,     referred  to  as  the  tangent  set.     Assuming  that     3"     is  a  linear  set,   let     t     be  the 

mean  square  projection  of     S       on     1,     that  is  characterized  by  the  two  conditions     t  s  3" 

-  T  - 

and     E[(S  -t)   t]  =  0     for  all     t  e  3",     and  let     S  =  S-t.     The  random  vector     S     is 
0  0 

T    -1 

referred  to  as  the  efficient  score.     Then     0(z)  =  (E[SS    ])     S     will  satisfy  equation 

(3.1)  and  can  approximated  by  a  linear  combination  of  scores,  so  that  the  semiparametric 
variance  bound  for  0     will  be     (E[SS   ])"  .     Begun,  Hall,  Huang,  and  Wellner  (1983)  and 
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Bickel,  Klaasen,  Ritov,  and  Wellner  (1992)  developed  this  projection  form  of  the  bound. 
The  form  of  the  efficient  score  is  the  main  result  of  this  Section,   so  we  first 

present  and  discuss  it,   and  then  give  regularity  conditions.     Let     v  =  v(x,/3   )     and     v     = 

T 

3v(x,)3    )/3£   3G(v)/3v,     where  by  convention  each  component  of     3G(v)/3v     is  set  equal  to 

zero  if  the  corresponding  component  of     v(x,/3)     does  not  depend  on     ft.     Theorem  4.1 
shows  that  the  efficient  score  is 

(4.3)  S  =  cr~2(x)u<v     -  E[cr~2(x)v    |v]/E[<r~2(x)|v]>,     o-2(x)  =  E[u2|x]  =  i>(x)"2E[m(e)2|x]. 

Thus,  the  semiparametric  variance  bound  is 

(4.4)  V  =  (E[SST])_1  =  (E[<r"2(x)v  vj  -  E[<r~2(x)v    |  v]E[cr~2(x)vL  v]/E[cT2(x)  |  v]]f  \ 

Although  the  bound  is  complicated,  it  has  a  straightforward  interpretation  in  terms 

of  an  optimally  weighted  m-estimator  where  the  regression  function     G(v)     has  been 

-2  2 

"concentrated  out."     Assuming  that     v(x)  >  0,     let     w(x)  =  Wx)/(r     (x)  =  l/Mx)E[u    |x]}, 

and 

G(v(x,j3),|3)  =  argmin  ,  ,     0..E[u(x)p(y  -  g)]  =  argmin       E[u(x)p(y  -  g)|v(x,/3)]. 

g(V(X,p))  gtIK 

Thus,  this  function  minimizes  the  population  value  of  a  weighted  m-estimation  criterion. 
Consider  the  estimator  of     8     that  minimizes  the  sample  counterpart  to  this  criteria, 
with     G(v(x,8),8)     substituted  for     g, 

(4.5)  8  =  argminj]."1w(x.)p(y.  -  G(v(x.,/3),|3)). 

This  is  an  estimator  where  the  unknown  function     G(v)     has  been  replaced  by  the  function 
that  minimizes  the  population  counterpart,   i.e.  where     G(v)     has  been  "concentrated  out" 
in  the  population.     By  the  usual  formula  for  a  parametric  m-estimator,     /3     will  have 
asymptotic  variance     (E[o-"2(x)Gp(x)Gp(x)T])_1     for     G^x)  =  3G(v(x,p),p))/S/3  |p=/3  .     This 
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7 
asymptotic  variance  is  the  semiparametric  bound,  because 

(4.6)  G_(x)  =  v0  -  E[o-"2(x)vJv]/E[(r"2(x)|v]. 

ft  ft  P 

This  interpretation  suggests  an  approach  to  efficient  estimation,  that  proceeds  by 
replacing     u(x)     and     G(v(x,p*),|3)     in  equation  (4.6)  by  nonparametric  estimators.     In 
the  single  index,   conditional  mean  case,  Ichimura's  (1991)  weighted  kernel  estimator  that 
uses  known     u(x)     can  be  interpreted  as  an  estimator  of     G     when     u(x)  =  Var(c|x)         is 
known.     In  general,   the  estimation  of     G     and     w(x)     will  not  affect  the  asymptotic 
variance,   so  that  this  estimator  will  be  efficient.     In  particular  it  follows  by 
Proposition  2  of  Newey  (1991a)  that  the  replacement  of     G     by  a  nonparametric  estimator 
will  not  affect  the  asymptotic  variance,  essentially  because     G     has  been  "concentrated 
out."     Also,  as  usual  in  m-estimation,  estimation  of     w(x)     will  not  affect  the 
asymptotic  distribution  under  appropriate  regularity  conditions. 

The  first  assumption  gives  regularity  conditions  for  the  distribution  of  the  data  as 
a  function  of     ft.     Let     G(w(x,ft),ft)     denote  the  location  functional  when     ft     is  the  true 
parameter  and     e(/3)  =  y  -  G(v(x,ft),ft).     The  way  that     G     can  depend  on     ft     directly  will 
be  left  unspecified,  because  it  does  not  affect  the  form  of  the  bound.     Let     E  [  •  ]     and 
E  [  •  ]     denote  expectations  for  a  parametric  submodel  when     tj  =  -q       and     ft  =  ft   , 
respectively,     of     ft.     Let     X     denote  the  support  of     x. 

Assumption  4.1:  i)  The  marginal  distribution  of     x     does  not  depend  on     /3;     ii)     ft  €  2 

for  an  open  set  ft     such  that  on     X  x  S,     v(x,|3)     is  bounded  and  continuously 
differentiate  in     ft     with  bounded  derivative,     G{v,ft)     is  bounded  and  continuously 
differentiate  in     ft     and     v     with  bounded  derivatives,  the  conditional  distribution  of 

7 
The  moment  restriction     E[m(e)|x]  =  0     implies  that     3E[cj(x)m(e)  I  v{x,ft)]/dft  = 

dEMx)E[m(e)Jx]|v(x,£)]/d/3  =  0.     Then  differentiation  of  the  first  order  conditions 

E[w(x)rn(y  -  G(v{x,ft),ft))  \  v(x,|3)]  =  0     with  respect  to     ft     gives     8G(w,ft)/dft  = 

-2  -2 

-E[cr     (x)v    |v]/E[cr     (x)|v],     so  that  differentiation  separately  with  respect  to  the 

two     ft     arguments  in     G(v(x,ft),ft)     gives  equation  (4.6). 
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y     given     x     at     3     has  density     f(y|x,/3)     that  is  regular  in     /3     with  probability  one 

9 

with  (conditional)  information  matrix  that  is  nonsingular  and  bounded,     E  [m(e(/3))    |x]  is 

bounded  and  bounded  away  from  zero;     iii)  there  is  a  compact  set     "S     containing  in  its 

interior  the  closure  of     G(v(X,S),B)     such  that  on     1  x  S  x  &,     E  [m(y  -  g)|x]     is 

continuously  differentiate  with  bounded  derivative;     iv)  3EJm(y-g)  |x]/Sg|      „,   ,     „,  „, 

p*  &  g=G(v(x,|3),/3) 

>  0     and  is  bounded  away  from  zero,     G(v(x,/3),/3)     solves     E_[m(y-g)|x]  =  0,     and 

P 

9E  [m(e)|x]/30  =  E[m(e)S    |x]. 

This  Assumption  consists  of  more  or  less  standard  regularity  conditions.     The  next 

hypothesis  imposes  additional  smoothness  conditions. 

Assumption  4.2:     For  any  function     £(y,x,3)     that  is  bounded,  continuously  differentiable 

in     y     and     p\     and  has  bounded  derivative,  the  integral     E[m(y-g)<(y,x,|3)  |x,/3]     is 

continuously  differentiable  in     /3     and     g     with  bounded  derivative.     Also,  there  exists  a 

bounded,  continuously  differentiable  function     m(e)     with  bounded  derivatives  such  that 

E_[m(e(j3))m(eO))  |x]     is  bounded  away  from  zero  uniformly  in     x,     /3. 
P 

The  first  hypothesis  is  similar  to  the  last  part  of  Assumption  3.2,   so  that 
primitive  conditions  for  this  condition  can  be  specified  as  in  Section  3.     Also,   it  is 

straightforward  to  formulate  more  primitive  conditions  for  the  second  hypothesis  with 

2 

particular  m(e).  By  E0[m(e(/3))  |x]  bounded  and  bounded  away  from  zero  it  suffices  to 

P 

2 
find     m(e)     such  that     EoKm(e(p))-m(e(0))>    |x]     is  small  uniformly  in     x     and     13.     For 

P 

example,  in  the  conditional  mean  case,  with     m(e)  =  e,     if     EgNe|         |x]     is  bounded  on 
X  x  B,     then  choosing     m(e)     so  that      |m(e)|    *   |e|      and     m(e)  =  e     except  when      |e|      is 
big  enough  will  satisfy  this  assumption.     Also,   in  the  conditional  median  case  with     c  = 
sgn(e)     with     e     continuously  distributed  given     x     with  bounded  conditional  density 
f(e|x),     choosing      |m(e)|    *  1     and     m(e)  =?  sgn(e)     except  in  a  small  enough  neighborhood 
of  zero  will  satisfy  this  assumption. 


-  16  - 


T 
Theorem  4.1:     If  Assumptions  4.1  to  4.3  are  satisfied  and     E[SS  ]     is  nonsingular  for     S 

from  equation  4.3,  then  for  the  class  of  parametric  submodels  such  that     G(v,t))     solves 

E  [m(y-g)\x]  =  0,     E  [m(y  -  g)\x]     is  continuously  differentiable  in  a  neighborhood  of 

(G(v,n0),-n0),     and     8E  [m(c)\x]/di)  =  E[m(c)S    \x],     it  follows  that     V  =  (ElSS7])'1     is 

the  supremum  of  Cramer-Rao  bounds  and  any  estimator     3     of     j3_     that  is  regular  satisfies 

Vn(fi  -  fi  )  — »  Z     +  U     where     Z       is  distributed  as     N(0,V)     and     U     is  independent  of 

# 
Z  . 


Of  particular  interest  for  studying  weighted  average  derivatives  is  the  case  where 

T       T  T 
v(x,£)  =  (x  |3,x   )    ,     corresponding  to  a  partial  index  model.     Here     /3     is  only  identified 

up  to  scale,  so  we  normalize  the  first  coefficient  to  be  one,  with     v(x,/3)  = 

T      T  T 
(x    +x  0/3,x„)    .      In  this  case,   for     G  (v)  =  dG(*s,x „)/3*s|    _  T      ,     the  efficient  score 

11       1Z  Z  1  Z  ^J— X-.+X.  _p-. 

is 

(4.7)  S  =  cr"2(x)u. G^vHXg  -  E[cr~2(x)x2l  v]/E[<r"2(x)  I  v]>. 


5.        Efficiency  of  Weighted  Average  Derivative  Estimators  of  Index  Coefficients 

As  we  discussed  in  Section  2,  average  derivatives  estimate  the  parameters  of  partial 
index  models  up  to  scale.     In  this  Section  we  consider  the  efficiency  of  average 

derivative  estimators  of  index  model  coefficients.     We  adopt  the  same  normalization  as 

T 
before,  where  the  first  coefficient  equals  one.     For     g(x)  =  G(x    +x  „p,x   ),     let     <5     = 

E[w(x)ag(x)/Sx    ]     and     5       =  E[w(x)Sg(x)/5x .„].     Assuming  that     w(x)     obeys  the 

identification  condition     8    *  0,     it  will  be  the  case  that     5„/5    =  /3,     so  that     £  = 

5„/5      will  be  consistent  for     (3.     The  asymptotic  efficiency  of     0     is  analyzed  in  this 

Section. 
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To  evaluate  the  efficiency  of  this  estimator  we  will  assume  that     5     satisfies 
equation  (3.3),  having  influence  function  given  in  equation  (3.5).     This  assumption  is 
justified  because  the  influence  function  of  any  (regular)  average  derivative  estimator 
satisfies  equation  (3.5),  by  Theorem  3.1.     We  also  assume  that  equation  (4.2)  is 
satisfied,  a  standard  regularity  condition  that  is  known  to  be  a  consequence  of 
regularity  of  the  estimator  (e.g.   see  Newey,   1990a). 

5.1      The  Asymptotic  Variance  of  Relative  Average  Derivative  Estimators 

Given  that     5     has  influence  function  in  equation  (3.5),  the  influence  function  of 

|§     (and  hence  its  asymptotic  variance)  can  be  derived  by  the  delta  method.     Let     «  = 

T 

x    +x  „/3     and  note  that  for  any  function     a(x)  =  a(x    ,x    ,x„) 

(5.1)  [-0,11a' (x)  =  SaU»  -  xJp.x^.x^/Sx^, 


i.e.     [-/3,I]a'(x)     is  the  partial  derivative  of     a(x)     with  respect  to     x     ,     holding     v 
constant.     Recalling  that     x„     is  included  in     v,     [-/3,I]g'(x)  =  0     since     g(x)     depends 
only  on     v.     Also,   5(5  /5  )/S5  =  §    [-3,1],     so  that  by  the  delta  method  the  influence 
function  of     0     will  equal 

(5.2)         «/»(z)  =  S^[-0,lKw(x)g'(x)-5  -  <w'(x)  +  w(x)f '  (x)/f  (x)>u 

=  £(x)u,  £(x)  =  -5^[-/3,lKw' (x)  +  w(x)f'(x)/f(x)>. 


For  notational  convenience  we  use  the  same     i/»(z)     and     £(x)     notation  here  as  in  Sections 
2  and     3,   even  though  the  expressions  are  not  the  same.     The  asymptotic  variance  of     0 
is  then     E[i/»(z)i/i(z)T]. 

It  is  interesting  to  note  that  the  term     Var(w(x)g' (x))     in  the  average  derivative 
bound  has  dropped  out  of  the  asymptotic  variance  of  the  index  estimator.     This  result  is 
consistent  with  the  interpretation  of  this  term  as  the  one  that  accounts  for  the  density 
of     x     being  unknown.     Since     (3     is  a  feature  of  the  conditional  distribution  of     y 
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given     x,     the  distribution  of     x     is  ancillary  for  estimation  of     /3,     and  knowledge  of 
this  distribution  should  not  affect  the  efficiency  of  estimators  for     £. 

5.2     Efficient  Weighting 

We  next  consider  the  efficiency  of     0     as  an  estimator  in  the  partial  index  model 
of  Sections  2  and  4,  where     p(e)     is  given.     In  this  model  the  efficiency  of     j§     depends 
on  the  weight  function     w(x),     and  it  would  be  useful  to  know  whether  a  weight  function 
can  be  chosen  so  that     /3     is  efficient,  with  asymptotic  variance  equal  to  the  bound  of 
Section  4.     In  this  subsection  we  consider  existence  and  the  form  of  such  an  efficient 
weight. 

Our  first  result  is  that  an  efficient  weight  function  will  exist  when     x     has  an 

2 

elliptically  symmetric  distribution  and     <r  (x)     depends  only  on     v,     as  described  in  the 

following  result. 

Theorem  5.1:     Suppose  that     i)     x     has  an  elliptically  symmetric  distribution,  with 

T 

nonsingular  variance  and  density     %((x-\i)  \(x-u))     for  a  differentiable     £('),     constant 

2  2  2 

vector     u     and  positive  definite  matrix     A;     ii)     cr  (x)  =  E[u    \v]  =  cr  (v)  >  0     and     G  (v) 

is  nonzero  with  probability  one.     Then  an  efficient  weight  function  is 
(5.5)     w(x)  =  <r~2(v)Gl(v)/h{(x-iL)T h(x-\i)),     /i(q)  =  l(q)/SqJ(r)dr. 

The  optimal  weight  depends  on  the  density  of  the  regressors  through  the  inverse  of  the 
hazard     /i(q).     It  does  not  effect  the  weight  if  and  only  if     /i(q)     is  constant, 
corresponding  to  exponential     £(q)     and  hence  normally  distributed     x.     In  cases  where 

£(q)   is  "thicker  tailed"  than  normal  (e.g.  f(q)     proportional  to     l/(l+qa),     a  >  1), 

T 

l//i(q)     will  tend  to  give  more  weight  to  larger  values  of     (x-ji)   A(x-fi),     while  if     £(q) 

is  "thinner  tailed"  than  normal  (e.g.     £(q)     proportional  to     exp(-ta),     a  >  1)     it  will 
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tend  to  give  less  weight. 

Some  properties  of  elliptically  symmetric  distributions  are  essential  to  existence 
of  an  efficient  weight,   because  of  implicit  constraints  on     £(x).     Let     x  =  (x  T,v   ) 
and     x,      denote  the  vector  of  all  elements  of     x     other  than  the     k     .     Then     £(x)     is 
constrained  in  the  following  way. 


Lemma  5.2:     E[l  (x)\x     ]  =  0,     k  *  q  -  1. 


This  result  can  be  interpreted  in  terms  of  the  amount  of  information  used  by  a  single 
average  derivative  ratio.     Each     C.(x)     is  the  term  multiplying     u     in  the  influence 
function  of     j§,  .     Also,   each     |3,      uses  only  the  restriction  that  the  index  is  of  the  form 
x    +<c,  (3,  ,     and  allows  for     g(x)     to  depend  on     x.      in  an  arbitrary  way.     In  other 
words,     £,      is  consistent  for  the  coefficient  in  a  partial  index  model  with  index 
x    +ai8,  .     Lemma  5.2  is  a  consequence.     It  is  exactly  the  condition  that  makes  the 
influence  function  uncorrelated  with  nuisance  parameter  scores  in  such  a  partial  index 
model,   as  required  by  consistency  of     |3,  ,     as  in  equation  (4.2).     In  contrast,  the 
elements  of     £*(x)  =  G  (v)<r~2(x){x12-E[(r~2(x)x12l  v]/E[(t~2(x)  |  v]>     from  the  efficient 
score  only  have  conditional  expectation  zero  given     v.     This  results  from  the  index  model 
imposing  more  restrictions  than  any  individual  average  derivative  ratio,   leading  to  more 
information  (variance)  in  the  efficient  score. 

Theorem  5.2  shows  that  it  is  sometimes  possible  to  combine  the  information  from  the 
individual  derivative  ratios  to  obtain  efficiency.     Intuitively,  although  the  individual 
terms  have  less  information  than  the  efficient  score,  together  they  can  have  as  much, 
because  the  index  interpretation  of  the  vector  of  derivative  ratios  comes  from  the  index 
model.     However,  the  requirement  that  a  single  average  derivative  vector  all  the 
information  is  quite  restrictive,   relying  on  linearity  of  certain  conditional 

The  declining  weight  implicit  in  the  density  weighted  estimator  of  Powell,  Stock,   and 
Stoker  (1989)  would  tend  to  have  high  efficiency  when  the     x     distribution  has  thinner 
tails  than  normal,   although  it  is  difficult  to  find  an  example  where     l/h{q)     behaves 
exactly  like     £(q). 
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expectations  as  a  necessary  condition. 

2 
Theorem.  5.3:     Suppose  that     <r  (x)  >  0     depends  only  on     v,     Prob(u  =  0)  =  0,     G  (v)     is 

nonzero  with  probability  one,  and     E[SS'  ]     is  nonsingular.     If  an  efficient  weight 

function  exists  then  for  each     k  £  q  -  1     there  is  a  vector     c       such  that 

E[(Ck\x_k]  =  E[xk\v]  +  cTk(<c_k  -  E[<c_k\v]}. 


Thus,   linearity  of     E[x,\x_]     in     <c_.      for  each     k     is  necessary  for  existence  of  an 

2 

efficient  weight,  when     cr  (x)     depends  only  on     v.     We  could  also  derive  a  result  for  the 

2 

case  where     cr  (x)     depends  on     x     in  a  more  general  way,  but  for  simplicity  we  have  not 

allowed  for  this  generality  here. 

Although     x     having  nonlinear  conditional  expectations  will  rule  out  the  existence 
of  an  efficient  weight,   approximate  efficiency  can  still  be  achieved  by  combining 
influence  functions  from  many  derivative  estimators.     Intuitively,   combining  average 
derivatives  from  different  ratios  imposes  the  index  model  information,  that  can  lead  to 
efficiency  if  results  from  different  weighting  functions  are  used.     Unfortunately,   the 
efficient  combination  will  not  generally  exist  in  closed  form,  and  hence  is  difficult  to 
describe. 

We  use  certain  Hilbert  space  results  to  argue  that  average  derivatives  can  be 
combined  to  achieve  approximate  efficiency,  but  not  in  closed  form.     The  basic  result 
that  is  useful  here  is  that  the  closure  of  the  direct  sum  of  closed  linear  subspaces  is 
equal  to  the  orthogonal  complement  of  the  intersection  of  orthogonal  complements  of  the 
subspaces.     Take  the  Hilbert  space  to  be  the  usual  one  of  functions  of     x     with  finite 
mean  square.     Take  the  subspaces  to  be  the  set  of  functions  of     x  satisfying  Lemma  5.2 
for  each     k.     These  have  orthogonal  complement  equal  to  the  set  of  functions  of     &_,  . 
The  intersection  (over     k  ^  q-1)  of  this  set  is  the  set  of  functions  of     v.     This 
intersection  has  orthogonal  complement  equal  to  the  set  of  functions  that  have 
conditional  mean  zero  given     v.     Then,  it  follows  by     E[£  (x)|v]  =  0     that  each  element 
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of    I  (x)     is  in  the  closure  of     <l]ix)+»"+l      (x)  :   E[lAx)\x     ]  =  Oh     Thus,  the  terms 
that  depend  on     x     in  the  efficient  score,  can  be  approximated  by  the  terms  in  the 
average  derivative  that  depend  on     x,     which  will  lead  to  approximate  efficiency. 
However,   it  is  impossible  to  give  an  explicit  form  for  this  additive  decomposition, 
because  a  direct  sum  of  subspaces  need  not  be  closed.     Also,  even  when  it  is  closed  the 
decomposition  into  the  sum  does  not  generally  have  an  explicit  form,  except  when  the 
subspaces  are  orthogonal  or  finite  dimensional.     Here  the  subspaces 

{£,(x)   :   E[L  (x)\x,  ]  =  0}     are  never  orthogonal,  and  many  finite  dimensional  cases  are 
covered  by  Theorem  5.1.     Thus,  in  general  an  explicit  form  for  the  optimal  combination  of 
different  weighted  average  derivative  estimators  cannot  be  found. 

5.3      Pooling  Weighted  Average  Derivatives  Via  Minimum  Chi-Square 

Combining  different  weighted  average  derivative  estimators  provides  a  way  to  achieve 

approximate  efficiency.     Also,   when  an  efficient  weight  exists,  it  may  have  components 

that  are  unknown,  and  pooling  estimators  with  known  weights  is  an  approach  to  feasible 

efficient  estimation.     An  optimal  way  to  pool  different  estimators  so  as  to  improve 

efficiency  is  by  minimum  chi-square  estimation,  that  can  be  described  as  follows.     Let     J 

£  2     linearly  independent  weighting  functions     w.(x),     (j  =  1 J),     be  specified, 

let     5.     denote  the  weighted  average  derivative  estimator  using     w.(x),     and  let     |3.  = 

5  .„/5  •,     be  the  associated  ratio  estimator.     Stack  the  separate  estimators  into  a  vector 
J2     Jl 

7  =  (£,...,£)     and  let     H  =  [I I]     =  e  ®I,     where     e.     is  the     J     vector  of  ones 

and     I     is  the  identity  matrix  with  dimension  equal  to  the  number  of  elements  of     (3.     Let 

0.(z)     denote  the  influence  function  of     (3.     (satisfying  equation  (5.2)  for     w(x)  = 
j  j 

T  T  T 

w.(x))     and  let     *(z)  =  d/».(z)    ^,(z)    )    .     The  asymptotic  variance  of     y     is  then     Q 

J  1  J 

=  E[*(z)*(z)    ].     Let     Q     be  a  consistent  estimator  of     fi,     such  as     Q  =  £. _.*(z.)*(z.)   /n 

for     *(z)  =  ($»  (z)    ^,(z)    )    ,     with     $.(z)     obtained  by  replacing  all  the  unknown 

components  in  equation  (5.2)  by  nonparametric  estimators.     The  pooled  estimator  is 
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(5.6)  0  =  argmin0(r  -  K0)Tft_1(y  -  K0)  =  (HTft  lH)  lHTQ  ly. 

P 

It  follows  from  standard  minimum  chi-square  theory  that     0     is  Vn-consistent  and 

T  -1     -1 

asymptotically  normal  with  asymptotic  variance     (H  ft    H)        that  is  no  larger  than  the 

asymptotic  variance  of  any  linear  combination  of  the     0.,     (j  =  1 J).     Also,   a 

T^— 1      —1 
consistent  estimator  of  the  asymptotic  variance  will  be     (H  ft    H)     . 

The  minimum-chi  square  estimator  will  be  efficient  if  the  efficient  score  is  a 

linear  combination  of  the  influence  functions.     Also,  it  will  be  approximately  efficient 

for  large     J     if  a  linear  combination  of  the  influence  functions  can  approximate  the 

efficient  score  in  mean-square.     We  state  these  results  in  the  following  theorem,  also 

giving  an  interpretation  of  the  difference  of  the  efficient  information  matrix  and  the 

the  minimum  chi-square  precision  matrix. 

Theorem  5.4:     v'1  -  H'Q~2H  =  min^[{S-TM(z)}{S-TN(z)}T ],     so  that  if  there  exists     II 
such  that     S  =  IIS     then     0     is  efficient.     Furthermore,  if  for  each     J     there  exists     IT 
such  that     limT       E[\\S-Tl  ^W2]  =  0     then     Urn  T       (H'Q^H)'1  =  V. 

The  second  condition,   on  existence  of  a  mean-square  approximation  of  the  scores  by  the 
influence  function,  is  a  "spanning  condition"  for  efficiency.     It  is  the  (minimum 
chi-square)  analog  of  the  generalized  method  of  moments  spanning  condition  given  in  Newey 
(1990b). 

To  achieve  approximate  efficiency  by  combining  weighted  average  derivative 
estimators,  the  weights  will  have  to  satisfy  a  spanning  condition  corresponding  to  that 
of  Theorem  5.4.     This  spanning  condition  is  quite  complicated,  because  the  influence 
functions  depend  on  both  the  weight  and  its  derivative.     For  this  reason,   we  do  not  try 
and  give  as  general  a  spanning  condition  as  possible,   instead  focusing  on  conditions  that 
are  relatively  easy  to  verify.     The  first  of  these  conditions  involves  the  data 
distribution.     Let     f(x)     denote  the  density  of     x     and  let     X,      denote  the  random 
variable  with  realization     cc,  . 
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2 

Assumption  5.1:     x     has  compact  support,     o-  (x)     is  bounded  and  bounded  away  from  zero, 

2 

E[llf'(x)/f(x)ll   ]  <  oo,     for     kiq-1     and  any  positive  integer     r,     df{x   \x  J/dx       and 

-2  r 

fix,  \x  ,  )     df{x,\x_,)/dx,Cov(l{X,£x,),X,\x_.)     are  continuous  on  the  support  of     x. 


The  last  part  of  this  condition  does  not  seem  primitive,  but  it  is  straightforward  to 
check  for  particular  distributions.  In  particular,  as  long  as  fix.  \x,  )  >  0  on  the 
interior  of  the  support,  the  last  expression  will  be  continuous  on  the  interior,  so  that 

it  suffices  to  show  continuity  on  the  boundary.     For  example,  suppose     T{x,\x_,)     is  a 

a  b 

beta  density,  proportional  to     x,  (l-x,  )       for  coefficients     a,     b     that  are  continuous  in 

x_ ,  ,     bounded,  positive,   and  bounded  away  from  one,  where  dependence  of     a     and     b     on 

x,      is  suppressed  for  notational  convenience.     Then 

f  (xk  |  x_k)~2dT{xk  |  a;_k)/axkCov(l(Xksa;k),X^  |  x_k) 

=  [a  -  (a+b)a^]|0lc(tr-[a/(a+b)]r)ta(l-t)bdt/[a^+1(l-a^)b+1] 

This  function  will  be  continuous  at  any     X    where     0  <  x,    <  1     (by  dominated 
convergence).     Also,  if  a  sequence  is  such  that     X,      converges  to  zero  or  one,  then 
continuity  follows  by  L'Hopital's  rule,  with  the  limit  of  this  expression  equal  to 
-a[a/(a+b)]r/(a+l)     at     0     and     -bU-[a/(a+b)]r}/(l-b)     at     1     for     a     and     b     evaluated 
at  the  limiting  value  of     x  ,  . 

When  Assumption  5.1  is  satisfied  there  are  simple  conditions  on  the  weight  functions 
for  the  spanning  condition  to  be  satisfied.     Let     xAt,x)     denote  the  vector  with     t     in 
the     k        position  and  other  components  equal  to  the  corresponding  components  of     x_^. 
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Assumption  5.2:     w.  (x)  =  w  (v)p    (a:)     with     1/C  <  w  (v)  <  C     for  some     C  >  0.     Also,  for 
each     k  s  q  -  1,     there  is  a  set     £  £  R,     inf(B)  =  -co,     such  that  for  any     e  >  0, 
continuous  function     a{x),     and     b  e  S,     there  is  there  is     J     and     n  .,..., n        such 
that     supy;\Y1.i]n    dp    (x)/dxk  -  b(<c)  |   <  e     and     £ 'it    p    (<c)  = 


Ej^jjj  k3PjJ(a:k(t,a:))/a^dt    for     J  >  J. 


This  hypothesis  says  that  the  partial  derivatives  with  respect  to  each     x,      can 
approximate  any  continuous  function,   and  that  the  linear  combination     £.,tc ..p.. (a:)     is  a 
definite  integral  of  the  partial  derivative.     It  is  easy  to  check  that  this  hypothesis  is 
satisfied  for  particular  choices  of     p. .(a;).     For  example,  suppose     p.  Ax)     is  a  power 
series  with  all  terms  of  a  given  integer  order  (sum  of  exponents)  and  below  included. 
Then  the  hypotheses  follow  by  the  Weirstrass  theorem  and  because  derivatives  and  definite 
integrals  of  power  series  are  also  power  series.     Also,  for  similar  reasons,  this 
hypothesis  will  be  satisfied  for  Gallant's  (1981)  flexible  Fourier  form. 

An  important  assumption  here  is  that     w  (v)     is  a  function  of     v     and     p.  Ax)     is  a 

T 
function  of     x,     both  of  which  depend  on  the  unknown  index     x    +x     /3,     so  that  these 

weights  are  not  feasible.     They  can  be  made  feasible  by  replacing     £     with  an  estimator. 

Because  the  choice  of  weights  does  not  affect  the  consistency  of  the  estimators,  under 

appropriate  regularity  conditions  the  replacement  of     X     with  its  corresponding  estimated 

value  will  have  no  effect  on  the  limiting  distribution  of  the  minimum  chi-square 

estimator,  and  hence  no  effect  on  its  efficiency. 

The  next  result  shows  Assumptions  5.1  and  5.2  are  sufficient  for  the  spanning 

condition  for  near  efficiency  of  minimum  chi-square. 

Theorem  5.5:     If  Assumptions  5.1  and  5.2  are  satisfied,  then     Urn  r       (H'Q    H)      =  V. 

J — Xd 
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5.4     The  Choice  of  Weights  in  Practice 

We  can  combine  the  near  efficiency  of  minimum  chi-square  with  the  form  of  the 
optimal  weights  given  for  the  elliptically  symmetric  case  to  suggest  a  sequence  of 
weights  that  should  have  high  efficiency  with  just  a  few  terms  included.     The  basic  idea 
is  to  initialize  the  weight  sequence  at  values  that  are  close  to  efficient  when  the 
regressors  are  normal,  add  first  some  terms  to  allow  for  nonnormal  but  elliptically 
symmetric     x's,     and  then  include  terms  that  would  account  for  non  elliptical 
distributions.     Let     G  (v)     and     cr  (v)     be  obtained  from  a  preliminary  parametric 

estimator  of  the  index  model.     For  example,   if     y     is  binomial,     g(x)     is  the 

T 

conditional  mean,   and     v  =  x      +  x  „/3,     then  corresponding  to  a  preliminary  probit 

estimator  one  might  choose     G  (v)  =  0(a  +  b«v)     and     a-  (v)  =  $(a  +  b«v)[l-$(a  +  b«v)] 
where     a     is  the  constant  and     b     the  coefficient  of     x        from  probit.     Let     n     and     Z 
be  the  sample  mean  and  variance  of  the  regressors  and     u».(*)     be  functions  of  a 
scalar  argument,  with     m  (•)  =  1,     and  other     m.     allowing  for  elliptically  symmetric 
distributions  other  than  normal,   such  as     m.{q)  =  [q/(l+q)]J    .     Then  a  weight  sequence 
that  is  initialized  at  normal  and  elliptically  symmetric  regressors  is  as  given  in 
Assumption  5.2  for 

w  (v)  =  &iv)/vZW,     p. (a:)  =  uiMx-^t'hx-ii)),     j  s  J, 

where     J     is  some  (small)  positive  integer.     Higher  (than     J)  order  terms  might  include 
powers  of     oc.     An  noted  above,  estimation  of  the  weights  will  not  affect  the  asymptotic 
distribution  of  relative  average  derivative  estimators. 

An  important  practical  issue  is  the  choice  of  number  of  weights  in  applications. 
One  way  the  number  of  weights  might  be  chosen  is  by  the  minimum  chi-square  analog  of  the 
GMM  cross-validation  criteria  suggested  in  Newey  (1990b).     The  basic  idea  is  to  use 
equation  (4.2)  to  construct  a  cross-validated  estimator  of  the  mean-square  of  the 
difference  between  the  efficient  score  and  the  influence  function,  which  by  Theorem  5.5 

-  26  - 


is  related  to  the  difference  of  the  efficient  information  and  the  minimum  chi-square 

precision.     By  equation  (4.2),     E[(S-lWz)MS-II*(z)>T]  =  E[SST]  -  IIH  -  HTIIT  + 

T    T 

IIE[*(z)¥(z)    ]TI   .     The  first  term  does  not  depend  on     J,     so  in  comparing  mean-squared 

error  for  different     J     this  term  can  be  dropped.     Let     *(z)     be  an  estimator,   as 

T 

described  above  for  the  minimum  chi-square  estimator,     Q  .  =  Y.   .*(z.)*(z.)   /(n-1),     and 

-i         J*i      J        J 

fi  .  =  (Q  .)_1H  =  [Q'1  +  n"1*(z.)(l-*(z.)Tn"1*(z.))"1*(z.)TQ~1]H.     Let     r     be  a  positive 
-1-1  1111 

definite  matrix.     Then  a  cross-validated  criteria  for  choosing     J     is 

cv(j)  =  tracetr-y;"^  .h  +  HTnT.  +  ft  .*(z.)*(z.)Tft  T». 

^i=l     -l  -l         -111-1 

T 
This  is  an  estimator  of  the  scalar  mean-square  error  criteria     EHS-TI*(z)}   r{S-IT*(z)>] 

9 
up  to  an  additive  constant. 

5.5      Efficiency  For  Conditional  Distribution  Index  Models 

When  the  conditional  index  distribution  model  of  equation  (2.4)  holds,  relative 
average  derivatives  will  estimate  the  same  coefficients  for  different     p(e)     functions, 

so  that  their  asymptotic  efficiencies  can  be  compared.     The  asymptotic  variance  is 

2  T  2  -2  2  2 

E[<r  (x)£(x)£(x)    ]     for     a-  (x)  =  v(x)     E[m(e)    |x].     As  discussed  in  Section  3,     cr  (x)     is 

the  asymptotic  variance  for  estimation  of  the  location  parameter     /i(x)  = 

argmin  E[p(y-u)|x],     so  that  the  asymptotic  variance  of  relative  average  derivative 

estimators  depend  on     p(e)     in  a  way  that  is  analogous  to  the  location  parameter.     Here 

this  dependence  is  even  more  direct,  since  the  first  term  in  the  average  derivative  has 

dropped  out.     For  example,   if  the  conditional  distribution  of     y     given     x     is 

"fat-tailed"  enough  at  each     x     so  that  the  median  has  lower  asymptotic  variance  than  the 

mean,  then  the  asymptotic  variance  of  a  weighted  average  derivative  estimator  based  on 

the  median  will  be  smaller  than  that  based  on  the  mean. 

9 
In  a  Monte  Carlo  example  for  a  different  semiparametric  estimation  problem,  given  in 

Newey  (1990b),   an  analogous  GMM  criteria  led  to  an  estimator  with  dispersion  that  almost 

as  small  as  that  of  the  best     J. 
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It  is  possible  to  approximately  attain  the  semiparametric  bound  for  the  conditional 
index  model,  that  is  given  in  Newey  (1990b),  by  combining  relative  average  derivative 
estimators  for  different  weights  and     p(e)     functions.     A  sufficient  spanning  condition 
is  that  the  weighted  average  derivative  estimators  are  calculated  from  all  combinations 
of  a  sequence  of  weights  satisfying  Assumptions  5.1  and  5.2  and  a  sequence  of     m(e) 
functions  for  which  linear  combinations  can,  for  the  conditional  distribution  of     e 
given  any     x,     approximate  any  function  with  finite  mean  square.     For  brevity,  we  have 
omitted  a  full  description  and  the  proof  of  this  result. 
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Appendix:     Proofs  of  Theorems 

Throughout  the  appendix  let     C     denote  a  generic  matrix  of  positive  constants  that  may  be 
different  in  different  appearances. 

Proof  of  Theorem  3.1:     Let     <(y,x)     and     r(x)     be  bounded  and  continuously 
differentiate,   let     <(y,x)  =  <(y,x)  -  E[<|x],     r(x)  =  r(x)-E[y(x)],     and  for  the  density 
f(z)     of     z     consider  the  parametric  submodel 

(A.l)  f(z|6)  =  f(z)[l  +  eT<(y,x)][l  +  9Ty(x)]. 

Both     C(y.x)     and     y(x)     are  bounded,  so  this  is  a  density  function  for     9     close  enough 
to     9     =  0.     Also, 

(A.2)  E_[a|x]  =  E[a|x]  +  9TE[a<|x]. 

o 

1/2 
In  addition,  mean-square  continuous  differentiability  of     f(z|6)     '  in  a  neighborhood     B 

of     8=0     follows  by  Lemma  C.4  of  Newey  (1991b),  with  score 
(A.3)  S.(z)  =  C(y,x)  +  y(x). 

By  Assumption  3.2  and  Lemma  C.2  of  Newey  (1991b),     E[^(y,x)|x]     is  continuously 

differentiate  in     x    with  bounded  derivatives,  and  hence  so  is     ^(y,x),     so  that 

E[m(y-g)|x]     and     E[m(y-g)<(y,x)  |x]     are  continuously  differentiable  in     (x  ,g)     on     OCX'S, 

and  hence  by  eq.   (A.2),     E  [m(y-g)|x]     is  continuously  differentiable  on     Xx§"x0.     In 

o 

particular,  by  continuity     0     can  be  chosen  small  enough  that  for  all     9  €  J?     there 
exists  a  unique  solution     g(x,8)     to     E  [m(y-g)|x]  =  0     for     g  €  5.     By  the  implicit 
function  theorem,  for  each     x,     9     in     Xx0     there  is  a  neighborhood  such  that     3g(x,9)/39 
and     g'(x,9)     exist  and  are  continuous  on  that  neighborhood,  so  that  by  compactness  of 
X,     0     can  be  chosen  small  enough  that     9g(x,9)/S9     and     g'  (x,0)     exist,   are  continuous 
and  bounded  on     Xx0,     and 
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(A.4)  dg(x,e)/aeie=0  =  -i>(x)  WteXlx]  =  EtuClx]. 


Next,  the  marginal  density  of     x     in  the  parametric  submodel     f(z|e)     is     f(x|8)  = 
f(x)[l+9y(x)],     so  that  by  integration  by  parts, 

(A.5)  5(9)  =  EQ[w(x)g'(x,9)]  =  E[w(x)g'  (x,9)]  +  9TE[w(x)y(x)g' (x,9)] 

=  E[£(x)g(x,9)]  -  9TE[f(x)-1[w(x)r(x)f(x)]'g(x,9)]. 

By     g(x,9)     differentiable  in     9     with  bounded  derivative,     5(9)     is  differentiate  in 
9,     and  by  eq.   (A.4)  and  another  integration  by  parts, 

(A.6)  35(9)/39|0=o  =  E[£(x)ag(x,9)/d8T|e=0]  -  Elftxr^wtxlytxlftxll'gtxll 

=  E[£(x)u<T]  +  E[w(x)g'(x)r(x)]  =  E[^(z)SQ(z)T]. 

Thus,     i//(z)     is  the  pathwise  derivative  for  all  parametric  submodels  as  specified  above. 
Next,  it  follows,  e.g.   by  Lemma  C.7  of  Newey  (1991b),  that  for  any     s(z)     with 

finite  mean-square  and     E[s(z)]  =  0,     and  for  any     e  >  0,     there  are     <(y,x)     and     y(x) 

2 

satisfying  the  above  boundedness  and  smoothness  hypotheses  with     E[lls(z)-£(y,x)ll   ]  <  e 

and     EUlEtslxl-rtx)!!2]  <  e,     so  that 

(A.7)  E[lls(z)-SQ(z)ll2]  =  E[lls(z)-E[s|x]  +  E[s|x]  -  C  -  r"2] 

^  2E[lls(z)-E[s|x]  -  C»2]  +  2E[IIE[s|x]  -  E[s]  -  rll2]  *  4e, 


where  the  last  inequality  follows  by  the  Cauchy-Schwartz  inequality.     The  first 
conclusion  then  follows  by  Bickel,  Klaasen,  Ritov,  and  Wellner  (1992,  Chapter  3,  Theorem 

2). 

2 

To  obtain  the  second  conclusion,  note  that     Ea[\\\p(z)\\    ]     is  continuous  in     9.     Then 

T 

by  regularity  and  Theorem  2.2  of  Newey  (1990a),     aS(e)/d9\Q=Q  =  E[i//(z)Se(z)    ],     so  by  eq. 


(A.6),     E[($(z)-|//(z))S   (z)    ]  =  0.     The  second  conclusion  then  follows  because     SQ(z) 


T 
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can  approximate  any  mean  zero  vector  function  in  mean  square,  and  hence  can  approximate 
0(z)-i//(z),     so  that     E[ll0(z)-i//(z)ll2]  =  0.     QED. 

We  prove  Theorem  4.1  using  two  Lemmas.     The  first  gives  the  form  of  the  tangent  set  and 
the  second  the  projection  on  the  tangent  set. 

Lemma  A.l:     If  Assumptions  4.1  are  satisfied  then  the  tangent  set,  for  all  parametric 
submodels  satisfying  the  hypotheses  of  Theorem  4.1,  is 

ST  =  (tCz)  :  E[t(z)2]  <  0,     E[t(z)]  =  0,     E[tu\x]  =  E[tu\v]}. 


Proof:     We  prove  this  result  by  showing  that  any  nuisance  score  must  lie  in     J     and 
exhibiting  a  class  of  parametric  submodels  that  can  approximate  anything  in     J     in  mean 
square.     Consider  first  any  nuisance  score  for  a  submodel  satisfying  the  hypotheses  of 
Theorem  4.1.     By  the  implicit  function  theorem     G(v,T))     is  differentiate  at     tj       and 
3G(v,7))/aT)  =  -v(x)     3E  [m(c)|x]  =  E[uS    |x].     Thus,     E[uS    |x]     is  a  function  of     v     and  so 

Tl  T)  V 

S       lies  in     9". 

To  construct  a  regular  parametric  submodel  with  score  that  can  approximate  anything 

in     3",     let     D     denote  the  statement  "the  function  is  bounded,  continuously 

differentiate  in     y     and    3,     and  has  bounded  derivatives."     Let     <(y,x)     satisfy     D. 

Then  by  Assumption  4.1  and  Lemma  C.2  of  Newey  (1991b),     E0[^(y,x)|x]     is  continuously 

P 

differentiable  in     (3,     with  derivative     E0[C(y,x)S0(y|x)  |x],     where     S0(y|x)     is  the 

P  P  P 

(conditional)  score  for     f(y|x,/3).     This  matrix  is  bounded  by     C     bounded  and  boundedness 

T  — 

of  the  conditional  information     E  [S  (y|x)S  (y|x)    |x].     Thus,     <(y,x,3)  =  0y.x)  ~ 

E  [<[(y,x)|x]     satisfies     D.     Also,   let     m(e)     be  as  specified  in  Assumption  4.2.     Then 

m(cO))  =  m(e(/3))-E  [m(e(£))|x]     satisfies     T>     by  Assumption  4.1  and  Lemma  C.2  of  Newey 

(1991b),     so  that  by  Assumptions  4.1  and  4.2     E0[m(e(|3))m(e(3))  |x]     and 

P 

EJ<;(y,x,3)m(e(/3))  |x]     satisfy     25.     For  a  function     a(v)     that  is  continuously 
differentiable  with  bounded  derivative,   let 
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C(y.x,/3)  =  <(y,x,0)  -  m(e(0))(E/3[m(e(0))m(e(0))|x])  *<E  [<(y,x,0)m(e(0))|x]  -  a(v(x,0))h 

Then  this  function  satisfies     2)     and     E  [C(y,x,0)|x]  =  0     by  construction.     Therefore, 

for  any  function     <(x)     with  mean  zero,     A(z,9)  =  (1  +  7)T<(y,x,0)(l  +  T)T<(x))     is 

continuously  differentiable  in     9  =  (0,tj)     with  bounded  derivative  and  is  bounded  away 

from  zero  and  one  for     7)     is  a  small  enough  neighborhood  of  zero,  and     E  [A(z,0)]  =  0, 

0  . 

so  that  by  Lemma  C.4  of  Newey  (1991b),  f(z|9)  =  f(z|/3)A(z,9)     is  a  density  with 
mean-square  continuously  differentiable  square  root,  with 

(A.8)  S^  =  <(z)  -  m(e)(E[m'(e)m(e)|x]f1<E[<(z)m(e)|x]  -  a(v)>. 

Thus,     f(z|9)     is  smooth.     To  show  that  it  is  a  parametric  submodel,  note  that 
(A.9)  EQ[m(e)|x]  =  7)Ta(v(x,0)). 

T 

by  Assumption  4.1,     EQ[m(y-g)|x]  =  E  [m(y-g)|x]  +  T)   E  [m(y-g)C(y,x,0)|x]     is  continuously 

differentiable  in     g     and  by  Assumption  4.1,  there  is     e  >  0     such  that  for  all     tj 
small  enough     3E  [m(y-g)  |x]/9g     is  bounded  away  from  zero  on 

[G(v(x,p),/3)-e,G(v(x,0),/3)+€],     uniformly  in     x     and     0.     Therefore,  by  eq.   (A.9),  for 
all     tj     small  enough  there  is     i>(v(x,0),7))  in  a  neighborhood  of  zero  such  that 

(A.10)  0  =  EJm(e  +  Wv(x,0),7))  |x]  =  EQ[m(y  -  G(v(x,0),9))  |x], 

G(v(x,0),9)  =  G(v(x,0),0)  +  Wv(x,0),T)). 


This  is  a  local  minimum  of     E  [p(y-g)|x]     by  continuous  differentiability  of  E  [m(y-g)[x] 

0  y 

and  the  derivative  bounded  away  from  zero,  and  a  global  minimum  for  small  enough     tj     by 
the  theorem  of  the  maximum  and  compactness  of     § '.     Thus,     f(z|0)     satisfies  the  index 
restriction,   and  hence  is  a  smooth  parametric  submodel.     Furthermore,  the  other 
hypotheses  in  Theorem  4.1  for  the  parametric  submodels  are  satisfied  by  construction. 
To  show  that     S       can  approximate  anything  in     'H ,     note  that  for     t  e  J,     by 
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boundedness  of     E[m(e)2|x],     E[IIE[m(e)i|v]ll2]  =  E[IIE[m(e)i|x]ll2]  =  E[E[m(e)2|x]E[llill2|x]] 

7 

^  CE[lltll    ].     Then  for  any     €  >  0     it  follows  by  Lemma  C.7  of  Newey  (1991b)  that  there 

exists     <;(y,x),     a(v),     and     y(x)     that  are  bounded  and  continuously  differentiable  with 
bounded  derivative  such  that     E[lli-<ll2]  <  e,     E[IIE[m(e)£|  v]  -  a(v)ll2]  <  e,     and 
E[IIE[i|x]-r(x)ll2]  <  e.     Also,     E[llt-E[i|x]-<ll2]  £  CE[lli-<ll2]  +  CE[IIEtt|x]-E[<;  |x]ll2]  s  Ce. 
Therefore, 

(A.ll)  E[llt-S   II2]  i  C<E[llt-EU|xKll2]  +  E[IIEU|xK(x)ll2] 

+  E[Var(m(e)|x)(E[m(e)m(e)|x]f2IIE[m(e)Clx]  -  a(v)ll2]> 

£  Ce  +  CE[IIE[m(e)Clx]  -  E[m(e)t|x]  +  E[m(e)i|v]  -  a(v)ll2] 

s  Ce  +  CE[IIE[m(e)(C-t)|x]ll2]  +  CE[IIE[m(e)t| v]  -  a(v)ll2] 

£  Ce  +  CE[E[m(e)2|x]E[ll<-ill2|x]]  s  Ce.     QED. 

2  -2 

Lemma  A.2:     If     E[u  ]  <  oo     and     E[<r     (x)]  <  co     then  the  projection  of  a     q  x  1     random 

vector     s     with  finite  mean-square  on     J     is 

s  -  E[s]  -  uR(x),     R(x)  =  (T2(xKE[u«s|xl  -  E[<T2(x)u«s| v]/E[<r~2(x)|v]>. 


Proof:     For  notational  simplicity  let     q  =  1.     By  the  Cauchy-Schwartz  inequality, 
E[<<r"2(x)uE[u's|x]>2]  s  E[E[<u/cr2(x)>2|x]E[u2|x]E[s2|x]]  =  E[s2]  <  a>,     and 
E[{o-"2(x)uE[o-"2(x)u.s|v]/E[<r"2(x)|v]>2]  s  E[cr"2(x)E[<r"2(x)2u2|v]E[s2|  v]/E[(T_2(x)  I  v]2]  = 

E[(r"2(x)E[<r~2(x)2u2|v]E[s2|v]/E[o-"2(x)|v]2]  =  E[E[<r~2(x)2E[u2|x]  |  v]E[s2|  v]/E[<r~2(x)  |  v]]  = 

2  -2  -2 

E[s  ],     where  all  the  expressions  are  finite  by     E[cr     (x)]  <  m     (implying     E[cr     (x)|v] 

2 
exists  and     Prob(c  (x)  =  0)  =  0).     Thus,  for  the  expression     t     given  in  the  statement  of 

-2  -2 

the  Lemma,  E[t  ]  <  oo.   Also,  note  that     E[R(x)|v]  =  E[E[<r     (x)us|x]|v]  - 

E[(r"2(x)|v]E[(r~2(x)us|v]/E[o-"2(x)|v]  =  0.     Then     E[tu|x]  =  E[su|x]  -  E[u2|x]R(x)  = 

-2  -2 

E[o-     (x)u«s|  v]/E[cr     (x)|v]     is  a  function  of     v,     so  that     t  e  9".     Also,   for  any     t  e  J, 

E[(s-t)t]  =  E[E[tu|x]R(x)]  =  E[EUu|v]R(x)]  =  E[E[tu|  v]E[R(x)  |  v]]  =  0.     QED. 
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Proof  of  Theorem  4.1:     By  Assumption  4.1  and  the  implicit  function  theorem,     f(z|0)     is 

smooth  with  score     SQ     satisfying 

p 

(A.  12)  E[uS    |x]  =  v     +  dG(w, B)/dB. 

This  equation  implies  that 

(A.  13)  E[(r"2(x)uSQ|v]  =  E[o-~2(x)E[uSjx]|v]  =  E[cr~2(x)vJ  v]  +  SG(v,p)/50E[a-"2(x)  |  v]. 

P  P  P 

By  Lemma  4.1  the  tangent  set  is     J,     so  by  Lemma  4.2  and  eq.   (A. 13)  the  residual  from  the 

projection  of     SQ     on     9"     is     S.     The  conclusion  now  follows  by  Bickel,  Klaasen,  Ritov, 
P 

and  Wellner  (1992,  Chapter  3,  Theorem  2).     QED. 

There  is  a  convenient  characterization  of  efficiency  that  is  useful  for  proving 
Theorem  5.1. 

Lemma  A.3:     A  regular  estimator  with  influence  function     \fi(z)     is  efficient  if  and  only 
if  there  is  a  constant  matrix     B     such  that     </»(z)  =  BS. 

Proof:  By  eq.  (4.2)  it  follows  that  E[^(z)ST]  =  I,  so  that  E[ip(z)\p(z)T  ]-(E[SST  ]fl  = 
E[0(z)^/(z)T]-E[i/»(z)ST](E[SST])_1E[Si/»(z)T]  =  minBE[(^(z)-BSH^(z)-BS>T]  =  0  if  and  only 
if  there  is  a     B     such  that     i/»(z)  =  BS.     QED. 

Proof  of  Theorem  5.1:     Let     q(x)  =  (x-ji)' A(x-jli).     Note  that     d/i(q)    /dq  + 
h(q)~ll{q)~ldh{q)/dq  =  1.     Then  by  eq.    (5.1), 

(A.14)  £(x)  =  o-'^vJG^vMx),     F(x)  =  -[-pMH^.OlAtx-M), 

where     I      is  an  identity  matrix  with  dimension  equal  to  the  number  of  elements  of     x^. 

T  -2 

Also,  by  eq.   (4.2)     Ellix)   ut]  =  0     for  all     t  e  1.     Note  that     o-     (x)ua(v)  e  J     for  any 

_2 
bounded  function     a(v),   since  it  has  finite  mean-square  (by     E[<r     (x)]     finite)  and 

E[u<o-"2(x)}a(v)u|x]  =  a(v)a-"2(x)E[u2|x]  =  a(v).     Thus,     0  =  E[£(x)   u<a-~  (x)>a(v)u]  = 
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E[«x)T<x(v)]  =  E[E[£(x)|v]Ta(v)]     for  any  bounded     <x(v),     implying     E[£(x)|v]  =  0     (by 

choosing     a(v)     to  approximate     E[£(x)|v]     in  mean-square.     Hence,     0  =  E[£(x)|v]  = 

-2  -2 

-a-     (v)G  (v)E[F(x)|v],     so  that     E[F(x)|v]  =  0  by     -<r     (v)G  (v)  *  0.     In  particular, 

T 
E[F(x)v   ]  =  0.     Furthermore,     [-0,1],   [I..0],     and     A     all  having  full  row  rank  and 

T    T  T 

nonsingularity  of     Var(x)     imply     Var(F(x))     is  nonsingular.     Then  since     x  =  (x.„,v   ) 

is  a  nonsingular  linear  combination  of     x,     it  follows  that     F(x)     is  a  linear 
combination  of     x-E[x]     that  is  orthogonal  to     v     with  nonsingular  variance.     Standard 
linear  regression  arguments  then  imply  that  there  is  a  nonsingular  matrix     D     such  that 
DF(x)  is  the  residual  vector  from  the  population  linear  regression  of     x  „     on     v. 

Linearity  of  conditional  expectations  for  spherically  symmetric  distributions  then  gives 

2  2 

DF(x)  =  x  „-E[x     |v],     so  that  by  eq.   (A. 14),     a-  (x)  =  <r  (v),     Lemma  A.3  gives  the 

conclusion  as  a  result  of 

D0(z)  =  D-B«x)u  =  ^(vJG^vJDFCxJu  =  er~2(v)G  (vHx^-Elx^  |  v]}u 
=  o-"2(v)G1(v){x12-E[<r"2(v)x12|v]/E[(r"2(v)|v]>u  =  S.     QED. 


Proof  of  Lemma  5.2:     As  noted  in  the  text,     5,/S.     is  an  estimator  of     0       in  the 

partial  index  model     g(x)  =  G(x11+Xlk^k'x12 xl  k-l,Xl  k+1 ^q'3^'     with 

influence  function     L  (x)u.     Then  by  the  argument  following  eq.   (A. 14),     the  conditional 
expectation  of     ^(x)     given  the  arguments  of     G     is  zero.     Furthermore,  the  arguments  of 
G     are  a  nonsingular  linear  combination  of     X,  ,     giving  the  conclusion.     QED. 


Proof  of  Theorem  5.3:     Let     w(x)     equal  the  efficient  weight,  and  without  changing 

2  -1 

notation  let     w(x)  =  <r  (v)G  (v)    w(x).     Then  by  eq.   (5.1)  and  Lemma  A.3  there  is  a  matrix 

B     such  that  S  =  B£(x)u,     so  by     Prob(u*0)  =  1, 
(A.15)  Ux)  =  B<x12-E[x12lv]>. 


Let     x.  =  (x._).     and     x    .  =  (x,„)    ..     Equation  (A.15)  implies     £.  x)  =  bix.-Ex.  v  )  + 
J  12  J  "J  12  -J  j  J     J         J 

T 
b    .(x    .-E[x    .|v])     for  constant  scalar     b.     and  vector     b   ..     Also,     b.  *  0,     because 
-J     -J         -J  J  -J  J 
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either     b.     or     b    .     is  not  zero  (because  nonsingularity  of     B     follows  from 

T  T 

nonsingularity  of     E[SS   ])     and  if     b.  =  0,     nonzero     c_.(x   .-E[x_.|v])     (again  implied 

J  J  J  J 

by  initeness  of  the  variance  bound  for  the  index  model)  contradicts  Lemma  5.1.     Then, 
dividing  by     b .,     we  obtain 

E[x  |v,x     ]  =  E[x  |v]  +  (-b_T/b.){x_.-E[x_.|v]>.     QED. 

J  J  J  J  J  J  J 

Proof  of  Theorem  5.4:     By  equation  (4.6),     V_1  -  H'Q-1H  =  E[SST]  - 
E[S*(z)T](E[*(z)*(z)T])_1E[*(z)ST]  =  min^US-ITCHS-ITO)1"],     and  the  other  statements 
follow  as  immediate  consequences. 


Proof  of  Theorem  5.5:     It  suffices  to  prove  the  result  with     w  (v)  =  1,     since     w  (v) 

2  *        2 

factors  out  of  each     £.(x).     By     <r  (x)     bounded  away  from  zero,     E[ll£  (x)ll    ]     is  finite. 

2  2 

Note  first  that  for  any  vector     b(x),  E[llb(x)ull   ]  s  CE[ llb(x) II    ],     so  by  Theorem  5.5  is 

*  J  2 

suffices  to  show  existence  of  square  matrices     IT.,     such  that     E[ll£  (x)-T.  ,n..£.(x)ll    ] 

— »  0     as     J  — >  oo.     Also,  by  the  Hilbert  space  fact  cited  in  the  text,  each  element  of 

i  (x)     is  in  the  closure  of     20, ©•••©£,,      ,,     for     £„.    =  {I:  E[i\x  ,1  =  0,  E[£2]  <  a.}. 

21  2,q-l  2k  -k 

Thus,   it  suffices  to  show  that  for  each     k     and     a(x)  e  £„,      there  are     n        such  that 
(A.  16)  E[\a(cc)-lJ  n    i. ,(x)|2]  -*  0. 

J— 1    JJ   JK 

Also,  by     x     bounded,  polynomials  in     X     are  dense  in     £»,     while 
E[{a(x)-{p(a:)-E[p(a;)|a:_.]»2]  =  E[Var(a(x)-p(<E)|a:_.)]  s  Var(a(x)-p(<c))  £  E[{a(x)-p(a;)>2], 

J  J 

so  that  the  set     {p(x)-E[p(ffi)  \x_.h  p(x)     is  a  polynomial)     is  dense  in     1^.     Therefore, 
it  suffices  to  show  that  eq.   (A.16)  is  satisfied  where     a(x)  =  p(x)-E[p(x)  \  x_  ]     for  a 


polynomial     p(x).     It  follows  by  Assumption  5.1  that     b(x)  =  a(x)  - 

x 
f(x)~2df(x)/dxA  ka(a:k(t,a;))f(a^(t,x))dt     is  continuous,  for     b  e  S     such  that     x^  >  b 

^  -1        f* 

on  the  support  of  x.     Note  that  a(x)  =  b(x)  +  fix)    df(x)/dxA      b(a^(t,a;))dt,  so  that 


b 


for  the     7T.       of  Assumption  5.2, 
JJ 
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lateJ-E.^ir.^MI   *   |b(<c)  -  J].^1ir.jap.k(<c)/a<ckl   +   |f(<c)  W (xVda^  | 


I J  k[b(a^(t,a;))  -  ^it.jap    (a^ft.a))^]^! 
£  (1  +   |f(x)"13f(a:)/axk|max;r|iEk-b|)e  *  C(l  +   |f(x)_1af(a:)/axkl  )e. 
Therefore,  by     c     arbitrarily  small,   it  follows  that  eq.   (A.  16)  is  satisfied.     QED. 
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